Package: sklearn.preprocessing
Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit. They might behave badly if the individual features do not more or less look like standard normally distributed data:Gaussian with zero mean and unit variance
The function scale provides a quick and easy way to perform this operation on a single array-like dataset:
In [2]:
from sklearn import preprocessing
import numpy as np
X = np.array([[1.,-1.,2.],
[2., 0.,0.],
[0., 1.,-1.]])
X_scaled = preprocessing.scale(X)
X_scaled
Out[2]:
In [3]:
X_scaled.mean(axis = 0)
Out[3]:
In [4]:
X_scaled.std(axis=0)
Out[4]:
Another utility class: StandardScaler, that implements the Transformer API to compute the mean and standard deviation on a training dataset so as to be able to later reapply the same transformation on the testing set.
In [5]:
scaler = preprocessing.StandardScaler().fit(X)
scaler
Out[5]:
In [6]:
scaler.mean_
Out[6]:
In [7]:
scaler.scale_
Out[7]:
In [8]:
scaler.transform(X)
Out[8]:
In [9]:
scaler.transform([[-1.,1.,0.]])
Out[9]:
In [10]:
X_train = np.array([[1., -1., 2.],
[2., 0., 0.],
[0., 1.,-1.]])
min_max_scaler = preprocessing.MinMaxScaler()
X_train_minmax = min_max_scaler.fit_transform(X_train)
X_train_minmax
Out[10]:
In [11]:
X_test = np.array([[-3., -1., 4.]])
X_test_minmax = min_max_scaler.transform(X_test)
X_test_minmax
Out[11]:
In [12]:
min_max_scaler.scale_
Out[12]:
In [13]:
min_max_scaler.min_
Out[13]:
MaxAbsScaler works in a very similar fashion, but scales in a way that the training data lies within the range [-1,1] by dividing through the largest maximum value in each feature. It is used for data that is already centered at zero or sparse data.
MaxAbsScaler and maxabs_scale were specifically designed for scaling sparse data, and are the recommend way to go about this.
...
If your data contains many outliers, scaling using the mean and variance of the data is likely to not work very well. You can use robust_scale and RobustScaler as drop-in replacements instead.
...
...
Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.
The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using the l1 or l2 norms:
In [16]:
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1.,-1.]]
X_normalized = preprocessing.normalize(X,norm='l2')
X_normalized
Out[16]:
normalize and Normalizer accept both dense array-like and sparse matrices from scipy.sparse as input.
Feature binarization is the process of thresholding numerical features to get boolean values. ...
As for the Normalizer, the utility class Binarizer is meant to be used in the early stages of sklearn.pipeline.Pipeline.
In [17]:
X = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1.,-1.]]
binarizer = preprocessing.Binarizer(). fit(X) # fit does nothing
binarizer
Out[17]:
In [18]:
binarizer.transform(X)
Out[18]:
In [19]:
binarizer = preprocessing.Binarizer(threshold=1.1)
binarizer.transform(X)
Out[19]:
The preprocessing module provides a companion function binarize to be used when the transformer API is not necessary.
binarize and Binarizer accept both dense array-like and sparse matrices from scipy.sparse as input.
Integer representation cannot be used directly with scikit-learn estimators, as these expect continuous input, and would interpret the categories as being ordered, which is often not desired.
One possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K or one-hot encoding, which is implemented in OneHotEncoder. This estimator transforms each categorical feature with m possible values into m binary features, with only one active.
In [21]:
enc = preprocessing.OneHotEncoder()
enc.fit([[0,0,3],[1,1,0],[0,2,1],[1,0,2]])
Out[21]:
In [22]:
enc.transform([[0,1,3]]).toarray()
Out[22]:
In [ ]:
In [ ]: